464 research outputs found

    S3Aug: Segmentation, Sampling, and Shift for Action Recognition

    Full text link
    Action recognition is a well-established area of research in computer vision. In this paper, we propose S3Aug, a video data augmenatation for action recognition. Unlike conventional video data augmentation methods that involve cutting and pasting regions from two videos, the proposed method generates new videos from a single training video through segmentation and label-to-image transformation. Furthermore, the proposed method modifies certain categories of label images by sampling to generate a variety of videos, and shifts intermediate features to enhance the temporal coherency between frames of the generate videos. Experimental results on the UCF101, HMDB51, and Mimetics datasets demonstrate the effectiveness of the proposed method, paricularlly for out-of-context videos of the Mimetics dataset

    Joint learning of images and videos with a single Vision Transformer

    Full text link
    In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer IV-ViT, and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.Comment: MVA2023 (18th International Conference on Machine Vision Applications), Hamamatsu, Japan, 23-25 July 202

    Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

    Full text link
    We propose Multi-head Self/Cross-Attention (MSCA), which introduces a temporal cross-attention mechanism for action recognition, based on the structure of the Multi-head Self-Attention (MSA) mechanism of the Vision Transformer (ViT). Simply applying ViT to each frame of a video frame can capture frame features, but cannot model temporal features. However, simply modeling temporal information with CNN or Transfomer is computationally expensive. TSM that perform feature shifting assume a CNN and cannot take advantage of the ViT structure. The proposed model captures temporal information by shifting the Query, Key, and Value in the calculation of MSA of ViT. This is efficient without additional coinformationmputational effort and is a suitable structure for extending ViT over temporal. Experiments on Kineitcs400 show the effectiveness of the proposed method and its superiority over previous methods.Comment: 9 page

    画像中の物体および人物領域の抽出手法に関する研究

    Get PDF
    名古屋大学 (Nagoya University)博士(工学)Engineeringdoctora

    プログラミング実習における新しい教材とその指導方法

    Get PDF
    平成16年度工学・工業教育研究講演会, スライド ; 開催場所 : 金沢工業大学, 石川県 ; 開催日 : 2004年7
    corecore